home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
EnigmA Amiga Run 1998 April
/
EnigmA AMIGA RUN 26 (1998)(G.R. Edizioni)(IT)[!][issue 1998-04].iso
/
earkit
/
browser
/
alynx
/
orig
/
crawl.announce
< prev
next >
Wrap
Text File
|
1998-03-13
|
4KB
|
95 lines
The TRAVERSAL code from old versions of Lynx has been upgraded by David
Mathog (mathog@seqaxp.bio.caltech.edu) so that it works again, can be
implemented via a command line switch (-traversal) instead of via a
compilation symbol for creating a separate Lynx executable as in those
previous versions, and can be used in conjunction with a -crawl switch
to make Lynx a front end for a Web Crawler.
Usage:
lynx [-traversal] [-crawl] ["startpage"]
Added switches are:
-traversal Follow all links that begin with the startpage. If
startpage isn't specified then the crawl begins with
the default start page.
-crawl With [-traversal] outputs each unique hypertext page
as an lnk###########.dat file in the format specified
below. With [-dump] outputs only the startpage, in
the same format, to stdout.
Note on startpage:
If a startpage is specified and contains any uppercase
characters, on VMS it should be enclosed in double-quotes.
The code that verifies that "startpage" is in any URL to
be traversed is case sensitive, and startpage will go to
all lowercase on VMS if no double-quotes are supplied.
Files created and/or used with the -traversal switch, based on definitions
in userdefs.h:
TRAVERSE_FILE (traverse.dat):
Contains a list of all URLs that were traversed. Note
that if a URL appears in this file it will not be
traversed again (important if runs are started and
stopped). Placing an entry in this file BEFORE the
run will block traversal of that URL. Unlike reject.dat
a final * has no effect (see below).
TRAVERSE_FOUND_FILE (traverse2.dat):
Contains a list of all URLs in the order traversed. A
URL may be present in this list many times. To simplify
the list, on VMS use: sort/nodups traverse2.dat;1 ;2
TRAVERSE_REJECT_FILE (reject.dat):
Contains a list of URLs that have been rejected from the
traversal. Once a URL has been entered in this list, it
will not be traversed. URLs that end in a * will cause
rejection of all URLs that match up to the character before
the *. So for instance, to reject all htbin references on a
site put this line in the reject.dat file BEFORE starting
the run: http://www.site.wherever:8000/htbin*
TRAVERSE_ERRORS (traverse.errors):
A list of links that evoked mailings to the document
owner if MAIL_SYSTEM_ERROR_LOGGING was defined in
userdefs.h (not recommended!!!).
Files created during traversals if the -crawl switch is included with the
-traversal switch:
lnk########.dat Numbered output files containing the contents of traversed
hypertext documents in text format. All hypertext links
within the document have been stripped, and the URL and
TITLE of the document are recorded as the first two lines,
e.g., for the seqaxp.bio.caltech.edu home page the first
two lines will be:
THE_URL:http://seqaxp.bio.caltech.edu:8000/
THE_TITLE:SAF Web server home page
The VMSIndex software is being adapted to use this
information to extract the corresponding URL and TITLE
for use in indexing the lnk########.dat files, e.g.:
$ build_index -
/url=(text="THE_URL:") -
/topic=(text="THE_TITLE:",EXCLUDE) -
/output=INDEX_NAME -
lnk*.dat
A clever person should be able to figure out a way to
index the lnk########.dat files on Unix as well.
This functionality is still under development. Feedback and suggestions
are welcome.